Last 7 Days (March 31 – April 06, 2026)
AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.
The paper establishes a unified mathematical framework for AI weather prediction pipelines, demonstrating that estimation error dominates architectural approximation error, formalizing MSE-induced spectral blurring, and deriving out-of-distribution extrapolation bounds for extreme events. The work successfully bridges domain-specific empirical observations with rigorous statistical learning and approximation theory, offering a principled explanation for why training methodology and loss design currently outweigh architectural choices in forecast skill. The spectral analysis of loss functions and the linear bias bound for extremes provide actionable theoretical insights that extend beyond meteorology to broader scientific machine learning and structured forecasting tasks. While the framework is highly valuable for the growing AI-for-science community and offers a prescriptive evaluation methodology, its immediate field-wide impact is tempered by its strong anchoring in geospatial domains and the current lack of a citation track record. The theoretical contributions are solid and well-validated empirically, positioning it as a strong reference for pipeline design in scientific ML rather than a paradigm-shifting general ML breakthrough.
Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness. Agentic misalignment tests whether frontier models autonomously choose harmful actions given ordinary goals. Across 16 models, agents engaged in blackmail (96% for Claude Opus 4), espionage, and actions causing death. Misbehavior rates rose from 6.5% to 55.1% when models stated scenarios were real rather than evaluations. The thesis does not fully resolve any of these problems but makes each tractable and measurable.
This thesis advances AI safety by introducing scalable mechanistic interpretability tools, efficient latent adversarial training, novel jailbreaking scaling laws, and systematic agentic misalignment evaluations. The work addresses four critical bottlenecks in alignment research, transforming qualitative safety concerns into quantifiable, tractable problems. The automated circuit discovery method significantly lowers the computational and temporal barriers to mechanistic analysis, while the latent adversarial training approach demonstrates that targeted residual-stream perturbations can efficiently neutralize embedded dangerous behaviors without the prohibitive costs of standard defenses. The empirical findings on jailbreak scaling and agentic misalignment provide crucial baselines for forecasting frontier model risks and highlight the fragility of current alignment techniques under realistic deployment conditions. While the contributions are highly impactful for the safety and interpretability communities and establish rigorous evaluation frameworks, the thesis synthesizes multiple distinct research threads rather than presenting a single unifying methodological breakthrough, and its broader influence on general ML training paradigms remains to be fully realized. It represents a strong, field-advancing contribution that sits just below the threshold for transformative, field-wide significance.
Comparing the internal representations of neural networks is a central goal in both neuroscience and machine learning. Standard alignment metrics operate on raw neural activations, implicitly assuming that similar representations produce similar activity patterns. However, neural systems frequently operate in superposition, encoding more features than they have neurons via linear compression. We derive closed-form expressions showing that superposition systematically deflates Representational Similarity Analysis, Centered Kernel Alignment, and linear regression, causing networks with identical feature content to appear dissimilar. The root cause is that these metrics are dependent on cross-similarity between two systems' respective superposition matrices, which under assumption of random projection usually differ significantly, not on the latent features themselves: alignment scores conflate what a system represents with how it represents it. Under partial feature overlap, this confound can invert the expected ordering, making systems sharing fewer features appear more aligned than systems sharing more. Crucially, the apparent misalignment need not reflect a loss of information; compressed sensing guarantees that the original features remain recoverable from the lower-dimensional activity, provided they are sparse. We therefore argue that comparing neural systems in superposition requires extracting and aligning the underlying features rather than comparing the raw neural mixtures.
The paper theoretically demonstrates that standard representational similarity metrics systematically fail under neural superposition, arguing for feature-level alignment over raw activation comparison. This work delivers a rigorous mathematical critique of widely adopted tools like RSA and CKA, revealing that they conflate representational content with encoding geometry when networks compress features into superposition. The insight is highly timely given the growing recognition of superposition in mechanistic interpretability and modern language model analysis, and it correctly identifies a fundamental blind spot in how the community currently evaluates representational similarity. While the theoretical derivation is clear and the implications for metric design are substantial, the paper primarily diagnoses a problem and outlines a principled direction rather than delivering a fully operationalized, drop-in replacement. Consequently, it will likely become a foundational reference for representation analysis and interpretability research, but its immediate practical impact across broader machine learning remains constrained by the need for follow-up work to develop robust feature-extraction alignment methods. It represents a strong, field-advancing contribution that sits just below the threshold for transformative, field-wide significance.
When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/
Primary: Not specified in provided text
All Institutions: Not specified in provided text
NearID introduces a context-controlled distractor framework and a two-tier contrastive objective that effectively disentangles object identity from background context, establishing a rigorous, computationally efficient benchmark and training protocol for identity-preserving vision representations. The paper's systematic ablation of loss components, thorough comparison against scaled VLMs, and transparent evaluation methodology provide a highly actionable contribution to representation learning and generative AI evaluation, though its scope remains focused on concept preservation rather than broader editing or zero-shot generalization.
The paper introduces a principled and highly targeted solution to a well-documented failure mode in vision encoders: background-context entanglement during identity-focused tasks. By constructing "near-identity distractors" that share the exact background but differ in foreground instance, the authors force the model to learn truly identity-invariant features. The proposed two-tier contrastive objective ($L_{NearID}$) is elegantly designed, combining a symmetric multi-positive InfoNCE term for discrimination with a softplus ranking regularizer to preserve graded semantic structure. The architectural choice to freeze a strong backbone (SigLIP2) and train only a lightweight MAP head (15M params) is pragmatic, ensuring computational efficiency while avoiding catastrophic forgetting. The methodology avoids unnecessary complexity and directly targets the evaluation bottleneck in personalized generation.
The experimental design is exceptionally rigorous. The authors conduct comprehensive ablations across loss components, hyperparameters ($\alpha$ for ranking, $\beta$ for cohesion), data composition, and inpainting engine diversity. The foreground-masking experiments effectively isolate background dependence, revealing that frozen encoders rely heavily on contextual shortcuts (+34-43% SSR gain upon masking), whereas NearID remains inherently background-invariant. The comparison against scaled VLMs (Qwen3-VL at 4B/8B/30B) is particularly insightful, demonstrating that even large multimodal models struggle with matched-context identity discrimination and suffer from inconsistent oracle alignment. The correlation analysis with DreamBench++ human judgments and oracle scores provides strong external validation. Computational cost reporting (6.5 A100-hours for training vs. 54+ hours for VLM evaluation) further underscores the practical advantage of the proposed embedding-based approach.
Excellent. The supplementary material provides exhaustive implementation details: exact denoising steps, CFG scales, scheduler choices, and inpainting strengths for all four generation engines; precise training hyperparameters (batch size, steps, epochs, mixed precision); full mathematical formulations of loss variants; and explicit evaluation protocols (Fisher z-transformation for correlation aggregation, SSR/PA definitions). The dataset construction pipeline is transparently documented, and the project page is provided. The clear separation of training data sources and the step-count normalization in ablations ensure that results are directly comparable and reproducible.
The benchmark is narrowly scoped to concept-preservation evaluation and does not address the identity-vs-edit-intent trade-off inherent in text-guided image editing, as acknowledged by the authors. The distractors are synthetically generated via inpainting, which may not fully capture the distributional complexity of real-world identity variations or natural scene occlusions. The method requires task-specific fine-tuning of a lightweight head rather than offering a zero-shot drop-in replacement, limiting immediate plug-and-play utility for general-purpose vision pipelines. Additionally, while the ranking regularizer successfully balances discrimination and alignment, the optimal $\alpha$ trade-off remains dataset-dependent and requires careful tuning.
This work directly addresses a critical evaluation gap in personalized generative AI, where inflated metrics from background shortcuts have historically obscured true identity fidelity. By providing a standardized, context-controlled benchmark and a computationally efficient training recipe, NearID will likely become a reference standard for evaluating subject-driven generation, retrieval, and editing systems. The framework's emphasis on isolating semantic signals from contextual confounders has broader applicability to video instance tracking, 3D asset retrieval, and multimodal alignment tasks where background bias degrades representation quality. NearID introduces a context-controlled distractor framework and a two-tier contrastive objective that effectively disentangles object identity from background context, establishing a rigorous, computationally efficient benchmark and training protocol for identity-preserving vision representations. The paper's systematic ablation of loss components, thorough comparison against scaled VLMs, and transparent evaluation methodology provide a highly actionable contribution to representation learning and generative AI evaluation, though its scope remains focused on concept preservation rather than broader editing or zero-shot generalization.
Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.
Primary: Tsinghua University
All Institutions: Tsinghua University, Alibaba Group, Tongyi Lab
SKILL0 proposes a dynamic curriculum-based reinforcement learning framework that progressively withdraws external skill context during training to internalize procedural knowledge directly into model parameters. The methodology offers a principled alternative to inference-time skill retrieval, demonstrating strong empirical gains and significant token efficiency across two agentic benchmarks, though its reliance on visual context rendering and offline skill curation limits immediate broad adoption and generalization to open-ended domains.
The paper introduces SKILL0, an In-Context Reinforcement Learning (ICRL) framework that systematically transfers skill knowledge from external context to model parameters via a dynamic, helpfulness-driven curriculum. The core methodology is well-motivated: starting with full skill scaffolding and progressively withdrawing it based on on-policy utility directly addresses the "crutch" problem where agents merely follow prompts rather than learning behaviors. The composite reward balancing task success and compression efficiency is pragmatic, and the theoretical bounds on distribution shift during curriculum stages provide useful stability guarantees. However, the reliance on visual context rendering to compress token overhead introduces a non-trivial dependency on vision-language encoders and rendering pipelines, which complicates the method's applicability to purely textual or non-VLM agent stacks. The dynamic curriculum's greedy selection heuristic is empirically effective but rests on a locally additive utility assumption that may not hold in highly non-Markovian, interactive environments.
The empirical evaluation is rigorous and well-structured, covering two distinct domains (ALFWorld for embodied text-based tasks and Search-QA for multi-hop retrieval). SKILL0 demonstrates consistent gains over strong RL baselines (+9.7% and +6.6%) while drastically reducing per-step token overhead to <0.5k. The ablation studies are thorough, validating the necessity of the linear budget decay, the three-step helpfulness filter/rank/select mechanism, and the validation interval trade-off. Training dynamics clearly exhibit the predicted rise-then-fall helpfulness trajectory, providing strong empirical evidence for successful skill internalization. The main limitation of the evaluation is its narrow scope: performance is only reported on two curated benchmarks, leaving open questions about generalization to open-ended, long-horizon, or highly stochastic environments (e.g., code generation, GUI navigation, or web automation).
High. The paper provides clear implementation details, including backbone models (Qwen2.5-VL-3B/7B), hardware configuration (4Ă— H800 GPUs), training steps (180), batch sizes, curriculum stages, and precise rendering parameters (font sizes, color coding, image dimensions). The SkillBank initialization strategy is explicitly cited, and the code repository is publicly linked. Reproducing the exact results would require access to the specified VLM and the visual rendering pipeline, but the methodological transparency is sufficient for independent verification and extension.
The framework heavily depends on the quality and coverage of an offline-constructed SkillBank, which requires domain-specific curation and offline grouping. The visual context rendering, while token-efficient, ties the approach to VLMs and may not seamlessly integrate with text-only agent pipelines. The curriculum schedule, though adaptive, still requires manual tuning of hyperparameters (number of stages, validation interval, initial budget). Additionally, the theoretical analysis assumes local additivity of skill utility and smoothness of the vision encoder, which are simplifying assumptions that may break down in complex, multi-agent, or highly dynamic settings.
SKILL0 addresses a critical bottleneck in agentic AI: the trade-off between inference-time context augmentation and model autonomy. By internalizing skills into parameters, the method promises significant reductions in inference latency, token costs, and retrieval noise, paving the way for more efficient, self-sufficient LLM agents. If scaled effectively, this paradigm could shift the field away from heavy RAG/skill-retrieval pipelines toward parameter-efficient post-training recipes. However, the internalization process risks catastrophic forgetting or behavioral rigidity if the curriculum is poorly calibrated, and the reliance on curated skill banks may introduce curation bottlenecks or domain bias in deployed systems. SKILL0 proposes a dynamic curriculum-based reinforcement learning framework that progressively withdraws external skill context during training to internalize procedural knowledge directly into model parameters. The methodology offers a principled alternative to inference-time skill retrieval, demonstrating strong empirical gains and significant token efficiency across two agentic benchmarks, though its reliance on visual context rendering and offline skill curation limits immediate broad adoption and generalization to open-ended domains.
AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.
The paper establishes a unified mathematical framework for AI weather prediction pipelines, demonstrating that estimation error dominates architectural approximation error, formalizing MSE-induced spectral blurring, and deriving out-of-distribution extrapolation bounds for extreme events. The work successfully bridges domain-specific empirical observations with rigorous statistical learning and approximation theory, offering a principled explanation for why training methodology and loss design currently outweigh architectural choices in forecast skill. The spectral analysis of loss functions and the linear bias bound for extremes provide actionable theoretical insights that extend beyond meteorology to broader scientific machine learning and structured forecasting tasks. While the framework is highly valuable for the growing AI-for-science community and offers a prescriptive evaluation methodology, its immediate field-wide impact is tempered by its strong anchoring in geospatial domains and the current lack of a citation track record. The theoretical contributions are solid and well-validated empirically, positioning it as a strong reference for pipeline design in scientific ML rather than a paradigm-shifting general ML breakthrough.
A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.
The paper introduces ThoughtSteer, a backdoor attack that hijacks continuous latent reasoning trajectories via minimal input perturbations, revealing a mechanistic paradox where individual latent vectors preserve correct information while the collective trajectory is adversarially steered. This work addresses a critical and timely vulnerability as the field transitions toward silent, token-free reasoning architectures. The attack’s design is conceptually elegant, leveraging the model’s own multi-pass amplification to bypass token-level defenses, while the mechanistic analysis linking Neural Collapse to latent attractors provides a rigorous theoretical grounding for why standard defenses fail. The empirical validation across architectures, scales, and fine-tuning regimes demonstrates robust practical relevance, and the trajectory-versus-vector encoding paradox opens a valuable new direction for mechanistic interpretability research. While the current evaluation is limited to mid-scale models and lacks broader citation traction, the conceptual framework and empirical rigor position it as a strong contribution to adversarial machine learning and reasoning safety, though it falls just short of field-redefining status pending validation on larger, production-scale systems.
Current LLM-based coding agents follow a serial execution paradigm: the model first generates the complete code, then invokes an interpreter to execute it. This sequential workflow leaves the executor idle during generation and the generator idle during execution, resulting in unnecessary end-to-end latency. We observe that, unlike human developers, LLMs produce code tokens sequentially without revision, making it possible to execute code as it is being generated. We formalize this parallel execution paradigm, modeling it as a three-stage pipeline of generation, detection, and execution, and derive closed-form latency bounds that characterize its speedup potential and operating regimes. We then present Eager, a concrete implementation featuring AST-based chunking, dynamic batching with gated execution, and early error interruption. We evaluate Eager across four benchmarks, seven LLMs, and three execution environments. Results show that Eager reduces the non-overlapped execution latency by up to 99.9% and the end-to-end latency by up to 55% across seven LLMs and four benchmarks.
Introduces a parallel execution paradigm for LLM code generation that overlaps token generation with code execution via AST-based chunking and dynamic batching, significantly reducing end-to-end latency. The work addresses a critical systems bottleneck in LLM coding agents by formalizing a streaming execution pipeline and deriving theoretical latency bounds. While the concept of pipelining generation and execution is conceptually grounded in established systems principles, the technical execution—particularly the robust handling of partial code parsing, dynamic error gating, and empirical validation across diverse models and environments—provides a highly practical framework that will likely be adopted by next-generation agent platforms. However, its impact remains largely confined to code-generation workflows rather than broader machine learning methodology, and it does not introduce fundamental algorithmic or theoretical advances that would elevate it to field-defining status. The strong empirical results and clear engineering contributions make it a valuable systems optimization, but it represents a targeted efficiency improvement rather than a paradigm shift for the wider ML community.
Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test ($T^2$) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. $T^2$ modernizes pretraining scaling laws with pass@$k$ modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from $T^2$ are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that $T^2$ scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making $T^2$ scaling meaningful in modern deployments.
The paper introduces $T^2$ scaling laws that jointly optimize pretraining and test-time compute, demonstrating that accounting for inference costs shifts the compute-optimal regime toward deliberate overtraining. This work addresses a critical blind spot in classical scaling paradigms by formally integrating test-time compute allocation into the pretraining budget optimization framework. The central insight—that optimal training should deliberately extend into the overtraining regime when inference scaling is anticipated—is both theoretically grounded and empirically validated through targeted pretraining runs. While the methodology extends existing scaling law literature rather than introducing a fundamentally new paradigm, its timing aligns precisely with the field’s rapid pivot toward reasoning models and test-time compute strategies. The empirical validation strengthens credibility, though practical adoption will depend on how robust these laws remain across diverse architectures, data mixtures, and post-training pipelines. It represents a timely, methodologically sound advance that will likely inform compute budgeting and training strategies for next-generation language models, warranting a strong evaluation without crossing into field-redefining territory.
Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness. Agentic misalignment tests whether frontier models autonomously choose harmful actions given ordinary goals. Across 16 models, agents engaged in blackmail (96% for Claude Opus 4), espionage, and actions causing death. Misbehavior rates rose from 6.5% to 55.1% when models stated scenarios were real rather than evaluations. The thesis does not fully resolve any of these problems but makes each tractable and measurable.
This thesis advances AI safety by introducing scalable mechanistic interpretability tools, efficient latent adversarial training, novel jailbreaking scaling laws, and systematic agentic misalignment evaluations. The work addresses four critical bottlenecks in alignment research, transforming qualitative safety concerns into quantifiable, tractable problems. The automated circuit discovery method significantly lowers the computational and temporal barriers to mechanistic analysis, while the latent adversarial training approach demonstrates that targeted residual-stream perturbations can efficiently neutralize embedded dangerous behaviors without the prohibitive costs of standard defenses. The empirical findings on jailbreak scaling and agentic misalignment provide crucial baselines for forecasting frontier model risks and highlight the fragility of current alignment techniques under realistic deployment conditions. While the contributions are highly impactful for the safety and interpretability communities and establish rigorous evaluation frameworks, the thesis synthesizes multiple distinct research threads rather than presenting a single unifying methodological breakthrough, and its broader influence on general ML training paradigms remains to be fully realized. It represents a strong, field-advancing contribution that sits just below the threshold for transformative, field-wide significance.
Comparing the internal representations of neural networks is a central goal in both neuroscience and machine learning. Standard alignment metrics operate on raw neural activations, implicitly assuming that similar representations produce similar activity patterns. However, neural systems frequently operate in superposition, encoding more features than they have neurons via linear compression. We derive closed-form expressions showing that superposition systematically deflates Representational Similarity Analysis, Centered Kernel Alignment, and linear regression, causing networks with identical feature content to appear dissimilar. The root cause is that these metrics are dependent on cross-similarity between two systems' respective superposition matrices, which under assumption of random projection usually differ significantly, not on the latent features themselves: alignment scores conflate what a system represents with how it represents it. Under partial feature overlap, this confound can invert the expected ordering, making systems sharing fewer features appear more aligned than systems sharing more. Crucially, the apparent misalignment need not reflect a loss of information; compressed sensing guarantees that the original features remain recoverable from the lower-dimensional activity, provided they are sparse. We therefore argue that comparing neural systems in superposition requires extracting and aligning the underlying features rather than comparing the raw neural mixtures.
The paper theoretically demonstrates that standard representational similarity metrics systematically fail under neural superposition, arguing for feature-level alignment over raw activation comparison. This work delivers a rigorous mathematical critique of widely adopted tools like RSA and CKA, revealing that they conflate representational content with encoding geometry when networks compress features into superposition. The insight is highly timely given the growing recognition of superposition in mechanistic interpretability and modern language model analysis, and it correctly identifies a fundamental blind spot in how the community currently evaluates representational similarity. While the theoretical derivation is clear and the implications for metric design are substantial, the paper primarily diagnoses a problem and outlines a principled direction rather than delivering a fully operationalized, drop-in replacement. Consequently, it will likely become a foundational reference for representation analysis and interpretability research, but its immediate practical impact across broader machine learning remains constrained by the need for follow-up work to develop robust feature-extraction alignment methods. It represents a strong, field-advancing contribution that sits just below the threshold for transformative, field-wide significance.
There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.
Primary: Not explicitly stated in text (likely Echelon AI Labs based on GitHub repository links)
All Institutions: Echelon AI Labs
This paper demonstrates through rigorous, multi-platform benchmarking that minimal terminal-based coding agents interacting directly with APIs match or exceed the performance of GUI and MCP-based agents at a fraction of the cost, providing actionable evidence for simpler, API-first enterprise automation architectures.
The paper employs a clean, controlled experimental design to isolate the effect of agent interaction modality (GUI vs. MCP tool-augmented vs. terminal/CLI) while holding the LLM backbone constant. The methodology is well-structured, featuring systematic ablations on documentation access, persistent skill accumulation, single vs. multi-agent orchestration, and hybrid terminal+browser access. The use of programmatic, state-based verification against live, containerized platform instances is a strong methodological choice that avoids the brittleness of string-matching or UI-script validators. However, the approach is fundamentally empirical and comparative rather than algorithmic; it does not propose a new agent architecture, training objective, or reasoning framework, but rather rigorously tests an existing design hypothesis.
The evaluation is comprehensive and practically grounded, spanning three distinct enterprise platforms (ServiceNow, GitLab, ERPNext) and four frontier LLMs. The authors thoughtfully address potential fairness concerns by reporting results on a subset of tasks feasible for all paradigms, demonstrating that terminal agents still maintain a cost-performance advantage even when MCP agents are not structurally handicapped by missing tools. The cost analysis is particularly valuable, showing 4-9x efficiency gains for terminal agents with comparable or better success rates. The qualitative error analysis and skill taxonomy provide actionable insights into agent behavior, failure modes, and knowledge accumulation patterns. One minor weakness is the reliance on single-seed evaluations due to cost constraints, which limits statistical robustness for small performance deltas.
The authors commit to releasing the full evaluation framework, datasets, environments, prompts, and code upon acceptance, which is standard and acceptable for arXiv submissions. The use of containerized environments, LiteLLM routing, and deterministic state validators establishes a strong foundation for reproducibility. The explicit acknowledgment of single-seed limitations and the use of sample-proportion standard errors for uncertainty estimation demonstrate methodological transparency. Full reproducibility will depend on the quality and completeness of the promised code release and environment snapshots.
The paper clearly identifies several limitations: (1) terminal agents fundamentally fail on tasks requiring browser-session state manipulation (e.g., impersonation), rendered UI interpretation (e.g., charts), or complex drag-and-drop interfaces; (2) hybrid agents underperform due to poor tool-selection policies, often defaulting to expensive browser interactions even when API calls are optimal; (3) human-oriented documentation can actively degrade performance by encouraging overly complex retrieval strategies; and (4) the evaluation is constrained to single-seed runs, leaving run-to-run stochasticity unquantified. Additionally, the benchmark may partially reflect models' pre-training exposure to popular APIs (e.g., GitLab), potentially confounding "agent capability" with "parametric memorization."
The findings have direct, practical implications for enterprise AI deployment, challenging the industry trend toward heavily abstracted MCP servers and GUI-driven agents in favor of lightweight, API-first terminal agents. This work will likely influence how practitioners design agent interfaces, prioritize platform API stability, and structure agent-oriented documentation. It also highlights important safety considerations, as terminal agents' broad execution capabilities require robust API-level access controls, sandboxing, and audit trails. While not a theoretical breakthrough, the paper provides a much-needed empirical anchor for the agent architecture debate and offers a reusable benchmark that will facilitate future research in enterprise automation, skill accumulation, and hybrid tool selection. This paper demonstrates through rigorous, multi-platform benchmarking that minimal terminal-based coding agents interacting directly with APIs match or exceed the performance of GUI and MCP-based agents at a fraction of the cost, providing actionable evidence for simpler, API-first enterprise automation architectures.
Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.
Primary: University of Technology Nuremberg
All Institutions: University of Technology Nuremberg, Carnegie Mellon University, International Institute of Information Technology Hyderabad
This paper introduces a lightweight early-fusion mechanism for steering frozen Vision Transformer representations via natural language, accompanied by a novel steerability benchmark. While the methodology offers a practical trade-off between language guidance and visual fidelity, its technical novelty is incremental relative to existing adapter-based VLMs, and broader adoption will depend on rigorous validation of zero-shot generalization, comprehensive baseline comparisons, and open-sourced implementation.
The core proposal of injecting natural language prompts directly into frozen Vision Transformer layers via lightweight cross-attention (early fusion) is a pragmatic architectural choice that addresses the late-fusion bottleneck of models like CLIP. By modulating intermediate visual features rather than fusing modalities at the output, the method attempts to preserve the rich spatial semantics of self-supervised backbones (DINOv2/MAE) while enabling targeted concept steering. The approach is technically sound and aligns with recent trends in parameter-efficient adaptation, though the underlying mechanism (cross-attention injection into frozen layers) is conceptually incremental relative to established adapter, prompt-tuning, and FiLM-based modulation literature. The introduction of a dedicated benchmark for quantifying "steerability" (control vs. representation degradation) is the strongest methodological contribution, offering a standardized metric that the field currently lacks.
The paper evaluates the proposed representations on anomaly detection and personalized object discrimination, reporting competitive zero-shot performance and OOD generalization. While these are meaningful downstream tasks, the experimental scope appears narrow for a method claiming broad representational utility. The evaluation lacks comprehensive linear probing results on standard vision benchmarks (e.g., ImageNet, COCO, ADE20K) to rigorously substantiate the claim that steering preserves generic visual quality. Additionally, comparisons against recent strong baselines in open-vocabulary vision and vision-language adapters (e.g., VPT, GLIDE, or recent prompt-tuning variants) are necessary to contextualize the reported gains. The zero-shot OOD claims are promising but require validation across more diverse domains and prompt complexities to rule out overfitting to specific semantic axes.
The reliance on frozen backbones and lightweight cross-attention modules inherently reduces computational overhead and lowers the barrier to reproduction. However, the provided text lacks critical implementation details such as exact layer placement strategies, hyperparameter sensitivity (learning rates, attention scaling, prompt length), and training compute budgets. Without open-sourced code, standardized evaluation scripts for the proposed steerability benchmark, and detailed ablation studies on architectural choices, full reproducibility remains uncertain. The community adoption of the benchmark will heavily depend on the release of a well-documented evaluation suite.
Early fusion via cross-attention introduces additional inference latency and parameter overhead compared to pure late-fusion pipelines, which may hinder deployment in resource-constrained settings. The steering mechanism's efficacy is inherently tied to the semantic alignment of the textual prompt with the frozen backbone's latent space, making it vulnerable to failure on abstract, compositional, or out-of-vocabulary concepts. Furthermore, steering specific visual concepts may inadvertently suppress semantically entangled features, leading to unintended representation collapse in downstream tasks. The frozen backbone assumption also limits the method's ability to adapt to radically novel visual domains without risking catastrophic forgetting or distribution shift.
Steerable visual representations could significantly streamline downstream vision pipelines by enabling fine-grained, language-guided control without full model fine-tuning, with direct applications in medical imaging analysis, industrial quality inspection, and personalized robotics. By decoupling generic feature extraction from task-specific guidance, the approach promotes more modular and efficient AI systems. However, the ability to selectively amplify or suppress visual features raises ethical considerations regarding potential misuse in biased representation engineering, targeted surveillance, or manipulation of automated perception systems. Transparent documentation of steering boundaries and failure modes will be essential for responsible deployment. This paper introduces a lightweight early-fusion mechanism for steering frozen Vision Transformer representations via natural language, accompanied by a novel steerability benchmark. While the methodology offers a practical trade-off between language guidance and visual fidelity, its technical novelty is incremental relative to existing adapter-based VLMs, and broader adoption will depend on rigorous validation of zero-shot generalization, comprehensive baseline comparisons, and open-sourced implementation.
Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.
Primary: Unknown
All Institutions: Unknown
Unify-Agent introduces a structured agentic pipeline and a dedicated factual benchmark to ground image synthesis in external world knowledge, offering a practical and empirically validated pathway toward more reliable, fact-aware multimodal generation. The work demonstrates that tightly coupling reasoning, search, and recaptioning significantly mitigates the hallucination and long-tail knowledge gaps of unified models, though the approach trades inference efficiency for factual controllability and relies on established compositional paradigms rather than introducing fundamental algorithmic breakthroughs.
The paper proposes a four-stage agentic pipeline (prompt understanding, multimodal evidence searching, grounded recaptioning, final synthesis) to decouple knowledge retrieval from image generation, directly addressing the frozen parametric knowledge bottleneck in unified multimodal models. This compositional architecture is methodologically sound and aligns with recent trends in retrieval-augmented and tool-augmented generation. The curation of 143K high-quality agent trajectories for supervised fine-tuning demonstrates careful data engineering, though the exact filtering criteria, trajectory annotation protocols, and quality assurance mechanisms would require full-text verification. The approach effectively reframes generation as a sequential reasoning-and-verification process rather than a single forward pass, trading architectural elegance for modularity and factual controllability.
The introduction of the FactIP benchmark (12 categories of culturally significant and long-tail factual concepts) is a strong empirical contribution that explicitly targets the evaluation gap in world-grounded synthesis. The reported improvements over the base unified model and competitive performance against closed-source systems suggest rigorous validation across standard and domain-specific metrics. However, without access to full ablation studies, compute budgets, or latency/throughput measurements, it is difficult to quantify the efficiency-accuracy trade-off inherent in multi-step agentic pipelines. The evaluation would benefit from explicit comparisons to simpler RAG baselines and end-to-end fine-tuning to isolate the marginal gain of the agentic formulation.
The release of a 143K trajectory dataset and the FactIP benchmark substantially improves reproducibility and provides a valuable resource for the community. However, the absence of explicit code repositories, training hyperparameters, optimizer configurations, and hardware specifications in the provided text limits immediate replication. Standard practices for unified model training (e.g., parameter-efficient fine-tuning, specific vision-language backbone choices, diffusion vs. autoregressive decoders) are implied but not detailed, which is a common shortcoming in early-stage arXiv submissions.
The sequential agentic pipeline inherently introduces significant inference latency and computational overhead compared to single-pass generation models. Reliance on external search tools creates vulnerability to retrieval failures, paywalled content, or outdated indices, which can propagate errors into the recaptioning and synthesis stages. The 143K dataset, while substantial, may not fully capture the long-tail distribution of global cultural or rapidly evolving factual concepts. Additionally, the approach likely struggles with highly abstract or stylistically driven prompts where factual grounding is secondary to creative expression.
This work meaningfully advances the integration of dynamic knowledge retrieval with generative modeling, offering practical utility for educational content creation, historical/cultural visualization, and professional design workflows requiring factual accuracy. The agentic paradigm also highlights a broader shift toward verifiable, traceable generative systems. However, it raises important considerations around the automated generation of culturally sensitive or historically contested imagery, necessitating robust safety alignment, provenance tracking, and transparent evidence citation to mitigate misinformation risks. Unify-Agent introduces a structured agentic pipeline and a dedicated factual benchmark to ground image synthesis in external world knowledge, offering a practical and empirically validated pathway toward more reliable, fact-aware multimodal generation. The work demonstrates that tightly coupling reasoning, search, and recaptioning significantly mitigates the hallucination and long-tail knowledge gaps of unified models, though the approach trades inference efficiency for factual controllability and relies on established compositional paradigms rather than introducing fundamental algorithmic breakthroughs.
Realistic reconstruction of dynamic 4D scenes from monocular videos is essential for understanding the physical world. Despite recent progress in neural rendering, existing methods often struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences while maintaining high-fidelity structural and motion coherence. At the core of our approach is a scalable motion field parameterized by cluster-centric basis transformations that adaptively expand to capture diverse and evolving motion patterns. To ensure robust reconstruction over long durations, we introduce a progressive optimization strategy comprising two decoupled propagation stages: 1) A background extension stage that adapts to newly visible regions, refines camera poses, and explicitly models transient shadows; 2) A foreground propagation stage that enforces motion consistency through a specialized three-stage refinement process. Extensive experiments on challenging real-world benchmarks demonstrate that MotionScale significantly outperforms state-of-the-art methods in both reconstruction quality and temporal stability. Project page: https://hrzhou2.github.io/motion-scale-web/.
MotionScale introduces a scalable, cluster-based motion parameterization and a decoupled progressive optimization pipeline that significantly improves the temporal consistency and geometric fidelity of 4D Gaussian Splatting for large-scale, long-duration monocular video reconstruction. The work directly tackles the compounding drift, memory bottlenecks, and temporal incoherence that have historically restricted dynamic neural rendering to short clips or highly constrained environments. By separating background expansion from foreground motion refinement and leveraging adaptive basis transformations, the authors provide a practical, memory-efficient pathway to scale Gaussian-based dynamic reconstruction beyond toy datasets. While the methodological advances are well-executed and empirically robust, the approach remains an evolutionary refinement of the 3D/4D Gaussian Splatting paradigm rather than a foundational architectural shift. It builds effectively on the representational breakthroughs of NeRF and 3DGS but does not introduce a new learning paradigm, cross-modal capability, or theoretical framework that would warrant field-wide redefinition. Consequently, the paper will likely serve as a highly cited baseline for dynamic scene modeling and downstream video generation tasks, offering substantial practical value for researchers tackling real-world monocular reconstruction without crossing the threshold into transformative, once-a-decade impact.
Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.
The paper introduces a cost-aware, adaptive evaluation framework that combines multidimensional item response theory with optimal experimental design, supported by a large-scale item-level dataset, to efficiently predict LLM capabilities across diverse unseen benchmarks. This work addresses a critical bottleneck in modern language model development by replacing exhaustive, static testing with a principled statistical approach that dynamically selects the most informative evaluation items. While psychometric models have previously been adapted for AI assessment, the integration of adaptive experimental design and explicit token-cost optimization represents a meaningful methodological advance that directly tackles benchmark saturation and computational waste. The accompanying dataset provides unprecedented granularity for analyzing latent ability structures and cross-task generalization, offering a valuable resource for the community. Compared to traditional evaluation paradigms that rely on fixed, monolithic test suites, this framework enables researchers to allocate computational resources more strategically during model development and comparison. The approach is technically rigorous and highly practical, though it extends established statistical machinery rather than proposing a novel learning architecture or training paradigm. Its focus on predictive validity and efficiency ensures strong relevance for both academic researchers and industry teams seeking scalable evaluation pipelines, positioning it as a solid contribution to the evolving landscape of LLM assessment.
Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.
The paper demonstrates that LLMs internally encode contextual privacy norms as linearly separable, independent directions and introduces a structured, theory-grounded steering method that outperforms monolithic interventions. This work makes a meaningful contribution to representation engineering and AI safety by reframing privacy failures as a misalignment between latent knowledge and behavioral output rather than a lack of conceptual awareness. By grounding the analysis in Contextual Integrity theory, the authors move beyond ad-hoc probing and propose a compositional steering framework that offers more predictable control over sensitive information disclosure. While the approach builds on established linear probing and activation steering techniques, its systematic decomposition of privacy into orthogonal parameters provides a clear methodological advance for structured representation manipulation. The findings are likely to influence how researchers design safety interventions, shifting focus from monolithic suppression to targeted, dimension-specific alignment. However, as a recent preprint with limited external validation so far, its broader field-wide impact will depend on reproducibility across diverse architectures and real-world deployment scenarios, keeping it within the range of a strong, specialized contribution rather than a paradigm-shifting breakthrough.
Existing humanoid table tennis systems remain limited by their reliance on external sensing and their inability to achieve agile whole-body coordination for precise task execution. These limitations stem from two core challenges: achieving low-latency and robust onboard egocentric perception under fast robot motion, and obtaining sufficiently diverse task-aligned strike motions for learning precise yet natural whole-body behaviors. In this work, we present \methodname, a modular system for agile humanoid table tennis that unifies scalable whole-body skill learning with onboard egocentric perception, eliminating the need for external cameras during deployment. Our work advances prior humanoid table-tennis systems in three key aspects. First, we achieve agile and precise ball interaction with tightly coordinated whole-body control, rather than relying on decoupled upper- and lower-body behaviors. This enables the system to exhibit diverse strike motions, including explosive whole-body smashes and low crouching shots. Second, by augmenting and diversifying strike motions with a generative model, our framework benefits from scalable motion priors and produces natural, robust striking behaviors across a wide workspace. Third, to the best of our knowledge, we demonstrate the first humanoid table-tennis system capable of consecutive strikes using onboard sensing alone, despite the challenges of low-latency perception, ego-motion-induced instability, and limited field of view. Extensive real-world experiments demonstrate stable and precise ball exchanges under high-speed conditions, validating scalable, perception-driven whole-body skill learning for dynamic humanoid interaction tasks.
Primary: The University of Hong Kong
All Institutions: The University of Hong Kong, Kinetix AI
SMASH introduces a scalable, perception-driven whole-body control framework that combines generative motion augmentation, task-conditioned motion matching, and egocentric vision to enable the first outdoor, onboard-only humanoid table tennis system with diverse, agile striking behaviors. The work demonstrates strong system-level engineering and practical RL/imitation integration, though its algorithmic novelty is incremental and its impact remains primarily confined to robotics and dynamic humanoid control rather than foundational machine learning.
The paper presents a well-integrated pipeline that addresses two critical bottlenecks in dynamic humanoid control: sparse motion data and reliance on external perception. The core methodological contribution lies in the scalable motion generation and matching framework. Training a conditional Motion-VAE with task-aligned regularizers (phase consistency, temporal smoothness, foot penetration penalty) and subsequently filtering outputs through a physics-aware tracker is a pragmatic and effective approach to bridge the gap between sparse human demonstrations and robot-executable priors. The decision to use task-conditioned nearest-neighbor motion matching rather than hierarchical skill learning or adversarial priors simplifies the training loop while maintaining strong task alignment. The RL formulation (PPO with asymmetric critic, gated impact-window rewards, and adaptive region/sigma scheduling) demonstrates careful engineering for sim-to-real transfer. The perception stack, while largely composed of established components (YOLO, HSV segmentation, stereo triangulation, AprilTag PnP, Adaptive EKF), is tightly coupled to the control loop and handles high-speed dynamics and ego-motion robustly.
The experimental design is thorough and appropriately structured. Simulation ablations cleanly isolate the contributions of data scale, tracker filtering, and adaptive training techniques, showing clear performance trends. The comparison against PPO, Mimic, and HITTER baselines effectively highlights the necessity of whole-body coordination and motion priors for this task. Real-world validation is a strong point: the system successfully executes diverse strikes (smashes, crouching saves, lateral movements) and achieves the claimed milestone of outdoor consecutive rallies using only onboard sensing. The perception error analysis as a function of time-to-strike provides valuable insight into the system's operational envelope. However, quantitative real-world success rates, rally length distributions, and failure mode statistics are underreported, leaving some ambiguity about long-term robustness.
The paper provides clear algorithmic descriptions, reward formulations, and observation structures, which are sufficient for conceptual replication. However, exact reproducibility is hindered by missing details: specific neural network architectures, hyperparameter schedules, simulation environment configurations, and compute budgets are not fully disclosed. The reliance on specific hardware (Unitree G1, ZED cameras, custom MoCap setup) and proprietary simulation pipelines will require significant engineering effort to replicate. The perception pipeline's distance-adaptive noise modeling and EKF reset logic are well-documented, which aids implementation.
The system is highly task-specific; generalizing the motion generation and matching pipeline to other dynamic manipulation or locomotion tasks remains unproven. The tracker-based filtering step, while crucial for dynamic feasibility, introduces a computational bottleneck and requires a pre-trained tracking policy, adding training complexity. The egocentric perception system, though robust in controlled outdoor settings, will likely degrade under severe lighting changes, ball occlusion, or highly unpredictable opponent play. The paper lacks quantitative metrics on long-term rally stability, recovery from missed strikes, and computational latency breakdowns across the perception-planning-control loop.
This work represents a meaningful step toward fully autonomous, deployable humanoid systems capable of high-speed dynamic interaction without external infrastructure. The scalable motion augmentation and task-aligned matching framework offers a practical blueprint for overcoming data scarcity in whole-body skill learning, with potential applications in sports robotics, agile manipulation, and human-robot collaboration. By demonstrating robust onboard perception coupled with expressive whole-body control, the paper helps bridge the gap between simulation-trained policies and real-world dynamic deployment. SMASH introduces a scalable, perception-driven whole-body control framework that combines generative motion augmentation, task-conditioned motion matching, and egocentric vision to enable the first outdoor, onboard-only humanoid table tennis system with diverse, agile striking behaviors. The work demonstrates strong system-level engineering and practical RL/imitation integration, though its algorithmic novelty is incremental and its impact remains primarily confined to robotics and dynamic humanoid control rather than foundational machine learning.